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On Weighted Low-Rank Approximation 

William RejQ 
Abstract 

Our main interest is the low-rank approximation of a matrix in R"^^" 
under a weighted Frobenius norm. This norm associates a weight to each of 
the (m X n) matrix entries. We conjecture that the number of approximations 
is at most min(m, n). 

We also investigate how the approximations depend on the weight-values. 
Keywords: Weight, Low-rank, Factorization, Missing. 

This is a short investigation concerning a best approximation of an arbitrary 
real matrix X by a "weighted" low-rank approximation WLRA according to 

X ^ A B', = WLRA 

[m X n] [m X p] [p x n] [m x n] 

with a condition on the ranks 

rank [A) = rank [B) = p and p < rank {X) . 

All matrices are real with X G M™^", A G M^'^p and B G W^'p. 

The attention is more focused on the approximations WLRA than on the 
factors A and B. The approximations under consideration minimize the weighted 
Frobenius norm 

m n 

\\X-AB' \\l,= J2 E [^^^- - ^')^.^]' = Mi^AB, (1) 
1=1 j=i 

as also defined by (1.3) of Higham (2002). 

In some sense, the minimization we face goes much further than the indefinite 
least squares problem of Bojanczyk, Higham and Patel (2003). The fact that 
both factors A and B are of concern and the allocation of a separate weight 
to each of the X-entries seriously complicate the matter. Nevertheless, this 
structure naturally arises in some contexts such as the smoothing of a data 
matrix of measurements where each of the measured entity can have its own 
limited precision. The scientists working in this domain know that minimization 
(II]) can have several solutions, a badly understood feature that can loosely be 
attributed to the lack of convexity of this norm in terms of all the entries of A 
and B, although the norm is convex in the entries of {A \ given B) and those of 
(B I given A). 

What is the possible number of best approximations WLRA = A B', under 
the weighted Frobenius norm ([1])? 

-•^Dr. Rey (widdy.rey(at)ginail.coni) is retired from the Philips Research Laboratories, Eind- 
hoven, The Netherlands. 
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This minimization is a NP-hard problem, as shown by Gilhs and Ghneur (2010). 
What exactly takes place is so little known that it may explain why many authors 
of application papers ignore the fact. Since Bradu and Gabriel (1978) introduced 
the biplot, many variations on approximating by a low-rank matrix have been 
investigated (Greenacre, 2012). 

The paper leads to a conjecture on the number of solutions; they can be 
quite many. Having started with generalities, Section 2 addresses some aspects 
of algorithmic convergence. Then, we turn our attention to a dual problem 
where we consider that the weights can vary, while the matrix entries are seen as 
arbitrary constants. At Section 4, we even imagine that the squared weights can 
become negative and wonder what can be expected from such an algebraic trick. 
This is a domain where the experience is fairly limited, although the trick gives 
insight on our problem. Sections 5 and 6 report numerical experiments that lead 
to the conjecture. Eventually, some algorithmic notes are the object of Section 
8. 

1. Generalities. 

There exist several algorithms for the evaluation of WLRA and the one 
with the lowest computational complexity is based on two weighted lin- 
ear regression steps. Starting with an arbitrary estimate of A, A^''\ it is 
improved by first evaluating the corresponding optimal B^''\ 

:\\X- A^''^ B' 11^2= MiHB, (2) 

and then estimating an improved A^^^'^\ 

-.WX-A 11^2= MiHA, (3) 

until the limit 

is sufficiently well attained. This algorithm is said to be alternating and has 
been thoroughly investigated in statistics since Gabriel and Zamir (1979); 
their method does not necessarily converge to an absolute minimum point 
(see the discussion in Section 6, page 491). 

Seeing that 

AB' ={AM) X (M-^ B') where M is full rank, (4) 

the factors A and B are defined up to the matrix scaling factor M. Hence, 
clearly, we estimate too many parameters when we work on all entries of A 
and B; the feature dramatically complicates the algorithmic test of conver- 
gence. To obviate the difficulty, several other algorithms have been inves- 
tigated ranging from brute force minimization on all entries of A (or B) to 
limiting the space search to a Grassmann product manifold, Manton et al. 
(2003). Various comparisons of the possible strategies have been reported 
and Srebro and Jaakkola (2003) deserves a special mention. Simonsson and 
Elden (2010), Yan (2010), Markovsky (2010) as well as Markovsky and Van 
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Huffel (2007) cover more specific aspects. Note that Van Huffel's school in- 
vestigates a norm that is shghtly generahzed compared to ([1]) ; their weight 
structure permits to introduce correlations between the X-entries. The re- 
cent work of Usevich and Markovsky (2012) about structured matrices is 
worthy of attention. Maronna and Yohai (2008) pay particular attention 
to algorithm initialisation. Okatani, Yoshida and Deguchi (2011) compare 
several methods where a damping factor is helping to find the global min- 
imum solution; their study is restricted to the field of computer vision. 

2. On convergence and conditioning. 

The two steps ([2]) and ([3]) lead to convergence seeing that each of them 
reduces the norm || . ||^2- This is a guarantee of convergence, although not 
of unicity of the approximation. 

The iterations nicely converge inasmuch as the implied weighted regressions 
are sufficiently well conditioned. Focusing on ([2]) , the j-th column of B' is 
evaluated by solving the equation 

2 

= Minfe^.,, for j = l,2,...,n 

or 

(X-A bjY W, (X-A bj) = Minj,, Wj = diag {w%, ...,wij) . 



1=1 



w. 



^,3 



— j fljfc bjk 



,fc=l 



Hence, the next m + n conditions arise 



Det {B' W B)^0 and Det {A' Wj A)^0, 



(5) 



where 

W' = diag {wl,...,wl) . 
Their dependence on A and B is weak, seeing (1^ . 

3. Generalisation to weights seen as variables. 

The minimization problem ([1]) has for solution WLRA = [A B') that is 
such that the derivatives in its vicinity cancel. Hence, it can also be viewed 
as a 'saddle point' as well as a 'fixed point' of the mapping A — )■ A^°°^ in 



{A B') : < 



^(oo) = = SaddlePoint = FixedPoint = Miha^b 
where 

B = B{A) : Er=i E;=i - {A B%,f = 



Mill 



B, 



Mill 



This equivalence between Min^^^, FixedPoint and SaddlePoint results from 
the convexity of || X — A B' ||^2 in the vicinity of the solution. Again, 
observe that the convexity of ([1]) is in term of product A B' but not jointly 
in both A and B, which practically means that the algorithms can only, 
if at all, guarantee convergence to a local minimum. We will shortly loose 



W. Rey, On Weighted Low-Rank Approximation, Feb. 2, 2013 @10:37 



4 



this convexity but, for the time being, let us be a httle more precise on the 
sort of derivatives we invoke when we speak of a saddle point. 

We address a function F of a matrix X; the function F{X) is in a metric 
space and varies continuously with variations of X. We expect F{X) to 
be Gateaux differentiable and, in a broadly generalized sense, we defined 
a 'saddle point 'X^ as being a point where all the directional derivatives 
vanish, namely 



d 



dX, 



■F{X) 



X=Xs 



lim 



F{Xs + h AX) - F{Xs 
h 



0, 



AX ||> 



whatever the bounded perturbation AX is. The present extension of the 
'saddle point' concept places us at some distance of the remarkable work 
of Benzi, Golub and Liesen (2005). 

At the risk of loosing the convexity, we substitute pseudo-weights to 
the non-negative wfy Then, our problem takes the form 



{A B') 



or rather 



^(oo) = ^ = FixedPoint 
where 

B = B{A) : YZi E?=i [Xi^j - {A B\,f = SaddlePoints, 
^ A(-) = A(-)(i?) : YZi E;=i [X^, - (A(-) B\,]^ = SaddlePoint^(.) 



(A B') 



' WLRA = = AB' = FixedPoint 

where 

-° • dB(°°) ^i=l ^j=l 



4(00) . d sr^m sr^n 
^ ■ dAi^) Z^i=l ^3=1 «J 



Xij - 



0. 



(6) 



The difference between the last two formulations is computational. 



4. Path following and anticipations. 

What is the possible number of best approximations WLRA 
a set of weights? 



A B', given 



With this question as an investigation direction, we first converted the 
problem to the saddle point set-up 06]). This induces us to consider the 
solutions as being functions of the pseudo-weights Zij and we wonder how 
these approximations vary when the pseudo-weights vary. 

Clearly, independently varying all the pseudo-weights greatly increases the 
number of concerned variables and the complexity of the problem. In order 
to keep it simple and to be able to 'see' how the solutions behave, we limited 
ourselves to following a "path" . This is now supported with the help of a 
numerical example to clarify the details. 

Starting with solutions on pseudo-weights, the path is in the space of 
pseudo- weights Zij and is parametrized by a parameter r. 
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For r = 0, the given set of pseudo- weights {..., Zij, ...} is of concern, 

r = =^ Zo = {...,Zij,...} = {...,wIj,...} . 
For r = 1, the unweighted case (with a unique minimum) is met with, 



r = l =^ Zi = {z,...,z} where z 



• For |t — 1| ^ 0, the pseudo-weight set Zr consists of positive and 
negative entries. Several saddle point solutions to ([6]) can be expected. 

Eventually, we follow the path from r ^ to r ^ 1, passing for r = by 
our set of interest, Zq. Such a path can be described by 

= Zo + r (Zi-Zo). (7) 

Note that more sophisticated forms could be of interest. 

Bearing our attention on the solutions of the saddle point problem ([6]) 
while following this path, 

• we know that 

r = 1 =^ a unique solution. 

• By continuity and remaining in the vicinity of the least squares ap- 
proximation, we anticipate 

|r — 1| ^ 1 =^ a unique solution. 

• However, we have no clear indication yet on what occurs further away 

|r — 1| ^ e =^ any number of solutions. 

5. Some numerical observations. 

A numerical example let see how the approximations WLRA vary as a 
function of r. 

We search for the (unique or multiple) rank-1 approximations to matrix X 
under weights (wjj) in minimization ([1]), 



^ / 6 \ ^ ^ I 2\ f 0.04 0.68 \ 
^=[12) = (^^'^■) = ^ 0.84 0.40 ) 



(8) 



and it turns out that the data set ([8]) has 2 rank-1 approximations under 
weighted Frobenius norm (fTj), 



WLRAfe r= 



5.871 0.202 
1.032 0.036 



W. Rey, On Weighted Low-Rank Approximation, Feb. 2, 2013 @10:37 



6 



and 

...TT,A .' 0-101 0.194 

WLRAc r 



1.030 1.968 

- Further on, the subscripts b and c will appear at Table 1 - These two 
approximations clearly are very different and they also are of different qual- 
ities. Defining a weighted root mean square error, Rmse, by 



^ 2 E,,<(WLRA,,-X,,; 
Rmse = — 



they respectively yield 

Rmseft = 0.8958 and Rmse^ = 0.8507. 



The set of weights ([8]) has the average 



E. . ty? . 
2J «J 



0.67837 



and the path Z^. is defined by Rule ((Tj), the set of pseudo- weights linearly 

I,.!,. . -7 jcr f 0.67837 0.67837 
passmg through the two sets Zq and Zi = 



0.67837 0.67837 

The solutions to saddle point problem 06]) vary along this path. Each ap- 
proximation can be seen as a point of a 1-dimension "Curve" parametrized 
by r in the (m x n)-dimension space of rank-p approximations. For in- 
stance, the curve corresponding to the SVD-approximation runs through 
the points 

WLRA,=o.9 = ( I'lll QQg2 ) ' 1^™^^ = 0-9000, 

WLRA^=i = ( ^'^y^ ^'^11 ] = SVD, with Rmse = 0.9011, 
\^ 1.110 0.065 J 

WLRA^=i.i = ( ^'^25 ) ^^^^ 1^™^^ = 0.9028. 



and 



This curve starts at r = —0.05227 and terminates at r = 5.19696. The 
right end point of this curve will be referred to as being a "Cut" ; there is a 
wfjir) that approximately cancels at this curve end, (^25.19697)2 1 = 0. This 
curve also passes through the worst of the two solutions, WLRAfe^T-=o- 

A thorough space search led to the finding of four different curves. They 
are briefly described at Table 1. 

This numerical example is small-size and bigger matrices are worthy of at- 
tention. We now search for the (unique or multiple) rank-2 approximations 
to the next matrix X under weights (wjj) in minimization (fTj) . 

/ 0.04 0.84 0.72 \ 
0.56 1 0.68 



X 



/ 6 4 6 \ 

2 2 9 

9 7 

VI 3 1/ 



and Zq = (wj^) 



0.12 0.40 0.52 
\ 0.60 0.48 0.32 / 



(9) 
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Table 1: Four curves and the cuts of example (!8l). 



Subscript 


r(Left end) 


r (Right end) 


Special r-values 








-1.43695 = Cut(2,2) 


a 


-8.27280 


-0.06485 


-0.06266 = Cut(i,i) 


b 


-0.18799 


0.09357 





c 


-0.05227 


5.19696 


and 1 


d 


4.07575 


5.19696 


5.19697 = Cut(2,i) 
416.500 = Cut(i^2) 



Example (I9j) has 3 solutions, namely the approximations 



/ 9.372 
2.152 
6.448 
\ 1.550 



and 



3.431 
1.704 
2.668 
0.618 

/ 



6.079 \ 
9.052 
6.754 
1.427 / 

-0.065 
2.908 
7.371 
0.104 



/ 2.114 3.974 6.095 \ 

3.424 2.112 8.486 

3.355 -0.239 7.572 

\ 0.290 2.875 1.584 / 



3.563 
2.774 
-0.743 
1.295 



6.285 \ 
8.363 
7.320 
2.433 / 



As for example ([5]), we report the main observations at Table 2. 

6. Discussion of numerical findings 

The tables must be read keeping in mind that the reported values of curve 
end-points have a limited precision. They have been identified by following 
the curves until they terminate and this procedure is somewhat coarse. 

The two cases reported for illustration are somewhat different. Example 
([8]) is so small that that the sheer appearance of a zero weight yields to a 
rank reduction such that the 'approximations' realise exact fits (at p = 1). 
This is not the case for min m, n > 1 as in Example ([9]) . 

Let us now list our main observations. When referring to given curves of 
Tables 1 and 2, we use the subscripts specified in those two Tables. 

(a) Cuts play a crucial role as end-points of curves (see a, c, d, k, I). 

(b) Most end-points are not in the vicinity of cuts. 

(c) Curves corresponding to paths ([7]) may pass through cuts (see g, i, j, m, o). 

(d) For p > 1, we observe that conditions ([5]) are usually satisfied at curve 
ends. 



(e) There are vicinities in r-values where no curve seem to exist. They 
correspond to combinations of positive and negative pseudo-weights. 
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Table 2: Curves and cuts of example ([9]). 



Subscript 


r(Left end) 


r (Right end) 


Special 


r- values 








-10.1509 


= Cut(4^i) 








-5.65039 


= Cut(2,l) 








-3.73810 


= Cut(3^3) 


e 


-3.36121 


-3.08193 












-2.67994 


= Cut(4^2) 


f 


-2.15933 


-2.10259 






g 


-1.82206 


-1.01699 












-1.54376 


= Cut(3 2) 








-0.94365 


= Cut(4 3) 


h 


-0.94702 


-0.32216 






i 


-0.24797 


0.01177 













-0.22259 


= Cut(3 1) 


j 


-0.06533 


0.27532 













-0.06461 


= Cut(i 1) 


k 


-0.06461 


2.93359 


and 1 


1 


2.55480 


2.93348 












2.93348 


= Cut(2,2) 


m 


3.75504 


8.40023 












4.64366 


= Cut(i^2) 


n 


5.07162 


6.66613 






o 


8.51944 


> 20.00 












11.8243 


= Cut(i 3) 


P 


16.0803 


> 20.00 












32.5488 


= Cut(2,3) 



(f ) The best solution at r = cannot always be attained from the SVD- 
solution at r = 1 (no curve may join these two solutions, see b and c 
of Table 1). 

(g) The number of solutions at r = is at least 1 and at most the mini- 
mum dimension of the approximated matrix. 

Table 3 reports the findings based on a large population of (X, Zq)- 
pairs. 

7. A conjecture 

Given a m x n-matrix X, it has up to Min(m,n) weighted low-rank ap- 
proximations. This is observed in the context of weighted Frobenius norm 

8. Algorithmic notes 

Two main numerical difficulties are encountered and both are approached 
by the use of "closest bases". We first introduce this concept and then 
describe the two difficulties, namely the path following procedure and the 
thorough space search. 



• Closest basis. 
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Given a set of vectors the Gram-Schmidt process is standard 

to build up a basis that is normed and orthogonal and that permits a 
perfect decomposition of these original vectors. This Gram-Schmidt 
basis is clearly not unique, seeing that different sets of such bases 
can be constructed by simply permuting the original vectors; unfor- 
tunately, this lack of uniqueness seriously complicate the numerical 
steps. 

The classical Gram-Schmidt process is now reminded by (llOp , before 
being slightly modified into (llip . 

for i := 1 to p do 

Si ■= J2j<^ (e'j e,) ej 
Ci . 62 

di . 62 

The stability can be greatly improved by constructing a basis that is 
"close" in directions to the original set ai, ...,ap. 
The "closest basis" is derived according to the next algorithm, an 
original method as far as we know. 

Repeat 



for i 


= 1 to p do 


Internal loop 1. 




:= ai/{a'i aifl'^ 


Initialisation and normalisation. 


for i 


= 1 to J9 do 


Internal loop 2. 






Half projection on all others. 


for i 


= 1 to p do 


Internal loop 3. 


ei 


:= ti - 5i 


Deflation. 


ai 


■■= ei 


Substitution. 



until sufficient convergence. 

(11) 

Algorithm ( II ip is iterative and therefore is slower than the Gram- 
Schmidt standard by ( II Op . Its convergence is quadratic and the global 
loop is little run. The extreme stability of this "closest basis" justifies 
the expense. 

• Path following. 

The path Z^. by ([7]) is with respect to the pseudo- weights and is 
associated to varying approximations 

Zr = Zq + T {Zi - Zo) ^ WLRA^ = {A B')r = Ar Bl. 

We remark that, even if {A B')^ is defined, Aj. remains indeterminate 
seeing (H]). However, if we knew A^, B^^ would immediately result 
from (Ilj), 

B,:\\X-A, B' \\l,= Mins. 

In order to follow the path ([7]), we impose a smooth variation on A^-. 
On the one hand, we restrict Ar to be orthonormed and, on the other 
hand, to be slowly varying. 



Base vector initialisation. 

(10) 



Projection on previous Cj. 



Deflation. 
Normalisation 
Substitution. 
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Given a Ar solution of ([6]) and corresponding to a given r- value, its p 
column vectors CLl , . . . , dp have size m, m > p, and we orthonormalise 
by algorithm (1111) . This is our first solution. 

Slightly modifying the first r-value and with the help of the first solu- 
tion as initialisation, we derive a new solution Ar- This is our second 
solution and it lies in a tight vicinity of the first (due to the fact that 
we used a closest basis rather than the classical Gram-Schmidt pro- 
cess). We have obtained two points of a curve. 

Further on, we apply a predictor-corrector algorithm with step length 
adaptation. Several strategies of Bates et al. (2008) are relevant. 

• Thorough space search. 

Counting the number of solutions existing for a given r-value is quite 
a problem. We resorted to applying a space search strategy. 
We start with an arbitrary initialisation and solve minimization ([1]). 
This yields a given solution WLRA,- (at r = 0). Repeating with other 
initialisations, we obtain either new WLRA^ or repeats of the already 
known solutions. 

The main trouble with the above strategy is due to possibly small 
radii of convergence. Some solutions WLRA,- can be discovered only 
when an initialisation is performed in their immediate vicinity; start- 
ing too far away, the minimizations converge toward some "dominant" 
solutions. 

Hence, it is peremptory to apply initialisations which evenly span the 
search space. 

Initialising N times, we need good starting A^-, each with m x p 
entries. These entries will be the coordinates of A^ points on a ball in 
j^mxp. eventually, each of the y4T--points is projected onto a subspace 
by orthonormalisation into a closest basis. Gross (2011) and Recht 
(2011) discuss what the sample size A^ must be. 

First, the A^ points on the ball are assigned as being the summits of 
a nearly regular polyhedron; then, they are slightly shifted as if they 
were exerting a repulsive force on the other points. We construct a 
dispersed phase of points on the ball surface. 
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Table 3: Maximum number of solutions. 



Dimensions 


Approximation 


Maximum 


m n 


rank 


number of solutions 


2 2 


1 


2 


2 3 


1 


2 


2 4 


1 


2 


2 5 


1 


2 


2 6 


1 


2 


3 2 


1 


2 


4 2 


1 


2 


5 2 


1 


2 


6 2 


1 


2 


3 3 


1 


3 


3 4 


1 


3 


3 4 


2 


3 


3 5 


1 


3 


3 5 


2 


3 


3 6 


1 


3 


4 3 


1 


3 


4 3 


2 


3 


5 3 


1 


3 


5 3 


2 


3 


6 3 


1 


3 


4 4 


2 


4 


4 5 


2 


4 


4 5 


3 


4 


4 6 


2 


4 


4 6 


3 


4 


5 4 


2 


4 


5 4 


3 


4 


6 4 


2 


4 


6 4 


3 


4 


5 6 


2 


5 


5 6 


3 


5 


5 6 


4 


5 


6 5 


2 


5 


6 5 


3 


5 


6 5 


4 


5 



